Word Embeddings Go to Italy: A Comparison of Models and Training Datasets
نویسندگان
چکیده
In this paper we present some preliminary results on the generation of word embeddings for the Italian language. We compare two popular word representation models, word2vec and GloVe, and train them on two datasets with different stylistic properties. We test the generated word embeddings on a word analogy test derived from the one originally proposed for word2vec, adapted to capture some of the linguistic aspects that are specific of Italian. Results show that the tested models are able to create syntactically and semantically meaningful word embeddings despite the higher morphological complexity of Italian with respect to English. Moreover, we have found that the stylistic properties of the training dataset plays a relevant role in the type of information captured by the produced vectors.
منابع مشابه
Sub-Word Similarity based Search for Embeddings: Inducing Rare-Word Embeddings for Word Similarity Tasks and Language Modelling
Training good word embeddings requires large amounts of data. Out-of-vocabulary words will still be encountered at test-time, leaving these words without embeddings. To overcome this lack of embeddings for rare words, existing methods leverage morphological features to generate embeddings. While the existing methods use computationally-intensive rule-based (Soricut and Och, 2015) or tool-based ...
متن کاملRobust Gram Embeddings
Word embedding models learn vectorial word representations that can be used in a variety of NLP applications. When training data is scarce, these models risk losing their generalization abilities due to the complexity of the models and the overfitting to finite data. We propose a regularized embedding formulation, called Robust Gram (RG), which penalizes overfitting by suppressing the disparity...
متن کاملEvaluation of word embeddings against cognitive processes: primed reaction times in lexical decision and naming tasks
This work presents a framework for word similarity evaluation grounded on cognitive sciences experimental data. Word pair similarities are compared to reaction times of subjects in large scale lexical decision and naming tasks under semantic priming. Results show that GloVe embeddings lead to significantly higher correlation with experimental measurements than other controlled and off-the-shelf...
متن کاملNot All Neural Embeddings are Born Equal
Neural language models learn word representations that capture rich linguistic and conceptual information. Here we investigate the embeddings learned by neural machine translation models. We show that translation-based embeddings outperform those learned by cutting-edge monolingual models at single-language tasks requiring knowledge of conceptual similarity and/or syntactic role. The findings s...
متن کاملUnsupervised Morphological Expansion of Small Datasets for Improving Word Embeddings
We present a language independent, unsupervised method for building word embeddings using morphological expansion of text. Our model handles the problem of data sparsity and yields improved word embeddings by relying on training word embeddings on artificially generated sentences. We evaluate our method using small sized training sets on eleven test sets for the word similarity task across seve...
متن کامل